TeaMs-RL: Teaching LLMs to Generate Better Instruction Datasets via Reinforcement Learning Shangding Gu shangding.gu@berkeley.edu UC Berkeley & Technical University of Munich Alois Knoll knoll@in.tum.de Technical University of Munich Ming Jin jinming@vt.edu Virginia Tech Abstract The development of Large Language Models (LLMs) often confronts challenges stemming from the heavy reliance on human annotators in the reinforcement learning with human feedback (RLHF) framework, or the frequent and costly external queries tied to the self- instructparadigm. Inthiswork,wepivottoReinforcementLearning(RL)—butwithatwist. DivergingfromthetypicalRLHF,whichrefinesLLMsfollowinginstructiondatatraining, we use RL to directly generate the foundational instruction dataset that alone suffices for fine- tuning. Our method, TeaMs-RL , uses a suite of textual operations and rules, prioritizing the diversification of training datasets. It facilitates the generation of high-quality data without excessive reliance on external advanced models, paving the way for a single fine- tuning step and negating the need for subsequent RLHF stages. Our findings highlight key advantages of our approach: reduced need for human involvement and fewer model queries (only 5.73%of the strong baseline’s total ), along with enhanced capabilities of LLMs in crafting and comprehending complex instructions compared to strong baselines, and substantially improved model privacy protection. Code is available at the link: https: //github.com/SafeRL-Lab/TeaMs-RL 1 Introduction In the dynamic realm of Large Language Models (LLMs), there has been a pronounced migration of their capabilities into diverse sectors, from chat robots (OpenAI, 2023; Zhao et al., 2023a; Touvron et al., 2023a) and robotics (Ahn et al., 2022; Ren et al., 2023), to autonomous driving (Fu et al., 2023; Tang et al., 2023). Amidst this broad applicability, the capacity to train with targeted instructions and pertinent responses has been integral for optimizing performance. LLMs, such as GPT-3 (Brown et al., 2020), ChatGPT- 4 (OpenAI, 2023), Llama-1 (Touvron et al., 2023a), and Llama-2 (Touvron et al., 2023b), are exemplars of this trend, showcasing enhanced capabilities when furnished with explicit human-generated instructions. Conventionally, this entailed considerable human input in both instruction creation and response generation, leading to expansive datasets for fine-tuning (Stiennon et al., 2020; Ouyang et al., 2022). Emerging from predominantly human-instructed models, a crucial inquiry emerges: Can LLMs be fine- tuned to adeptly handle complex instructions without human feedback? The potential gains from this direction are multifaceted. Chief among them is the direct cut in costs tied to human annotations (Askell et al., 2021). Beyond the monetary aspect, such a transition also alleviates potential biases seeded by human curators (Gallegos et al., 2023). A loftier, albeit less conspicuous aim, is to amplify the quality of instructions and boost model performance. Several methodologies have surfaced in pursuit of these merits. 1arXiv:2403.08694v4 [cs.CL] 1 Mar 2025 current paradigm (SFT on instruction dataset + RLHF) TeaMs-RL (SFT on RL-enhanced instructions) pre- alignment LLMpost-SFT LLMSFT RLHFhuman-generated instructions expert LLMinstruction dataset post-RLHF LLM RL-generated instructions instructor LLMreviewer LLM expert LLMRL-enhanced instruction dataset diversity (reward)expert LLMFigure 1:Comparative overview of LLM alignment techniques. Current methods (red shaded region) typically involve a two-phase process, starting with Supervised Fine-Tuning (SFT) of a pre-aligned LLM using a dataset of human-crafted instructions and corresponding responses (often sourced from an expert LLMlike ChatGPT), leading to a post-SFT LLM. This is then fine-tuned using RLHF, where human feedback on preferences is incorporated, resulting in a post-RLHF LLM. In contrast, our TeaMs-RL method (green shaded region) employs a single-phase SFT approach, initially utilizing RL for teaching expert LLMs to generate high-quality instructions. We train an RL policy ( instructor LLM ) to create diverse instructions (with the diversity evaluated by a reviewer LLM as a reward signal). Once trained, the instructor LLM produces a set of actions to teach an expert LLM to generate high-quality instructions, and the instructions are leveraged to query the expert LLM to form the SFT instruction dataset. This approach capitalizes on the strengths of RL to enhance the complexity of instructions and consequently the value of responses from theexpert LLM . Note that the expert LLMs are not involved in the training of the RL policy; we only use the instructor and the reviewer LLM for training the RL policy. The expert LLM is used to generate instructions and corresponding responses under the guidance of the trained RL policy. The self-instruct method by Wang et al. (2022) pioneers this direction by using some external LLM to generate responses to human-generated seed instructions for instruction dataset curation. A notable recent development is the evolutionary strategy presented by Xu et al. (2024). Here, LLMs are seeded with initial instructions, gradually evolving towards generating more complex directives within predefined constraints. Despite their method’s commendable performance relative to alternative models, it necessitates a multitude of interactions with an expert LLM (e.g., advanced LLMs such as ChatGPT), potentially raising concerns regarding resource demands. In this research, we propose a novel method to improve instruction quality with the principles of Reinforce- ment Learning (RL) (Sutton & Barto, 2018), hence enhancing LLMs’ ability to comprehend and effectively executeintricateinstructionswithouthumaninvolvement, whilealsoboostingmodelprivacyprotection. Our method, outlined in the green shaded region of Fig. 1, begins with training an instructor LLM (RL policy) for teaching an expert to generate diverse, complex instructions. These are then used to elicit expert LLM responses, forming a diversity-enhanced instruction dataset. The finalstage of supervised fine-tuning (SFT) of a pre-alignment LLM (e.g., Llama-1) with this dataset enhances its complex task processing capabilities. Notably, our study shows that the thoughtful framing of instructions (using RL) is equally, if not more, crucial than generating responses from external sources such as expert LLMs or human feedback . Moreover, to clarify our methodology and facilitate replication, we provide a detailed example illustrating the inputs and outputs at each stage of our process, similar to Figure 1, see Appendix A. 2 The key benefits of our methodology include: Firstly, a reduction in the need for human instructors (e.g., annotators and evaluators), offering a more cost-effective alignment paradigm that supports the continuous development of capable and affordable LLMs. Secondly, our approach moderates the need for frequent queries to external models, yielding monetary benefits and mitigating the environmental impact of already power-hungry data centers (Dhar, 2020; Wu et al., 2022). In the same experimental settings, our experiment results demonstrate that our method shows better performance than strong baselines such as WizardLM, while significantly reducing the number of expert LLM queries by over 94% . This lesser dependence on external data (for either SFT or RLHF) can potentially broadens the scope to other engineering and science disciplines where data availability is limited or data collection is costly. This aspect is also crucial in mitigating privacy concerns (Dong et al., 2022; Ko et al., 2023). 2 Related Work and Contribution A multitude of studies have explored the training of language models using instructions paired with their respective responses. Notable works in this realm include GPT-3 (Brown et al., 2020), ChatGPT-4 (OpenAI, 2023), Flan collection (Longpre et al., 2023), Flan models (Wei et al., 2021), and Alpaca (Taori et al., 2023). The prevalent methodology often requires human annotators to craft instructions and curate corresponding responses, leading to the assembly of detailed instruction-response datasets. Such datasets, exemplified by the one used in GPT-3’s training (Brown et al., 2020), tend to resonate well with human preferences and markedly improve language model proficiency. However, (crowd) sourcing these datasets through human means can be not only costly but also prone to issues of bias (Gallegos et al., 2023). Conversely, the self-instruct approach (Wang et al., 2022) charts a different course by tapping into the po- tential of LLMs to autonomously generate both instructions and their responses, facilitated by the provision of predefined seeds. While this strategy alleviates the dependence on human effort, it might not consistently capture the breadth and depth of diverse instructions and responses typically achieved with human anno- tators. Building on the pioneering self-instruct methodology, Xu et al. (2024) introduced WizardLM, an evolutionary instruction approach. In WizardLM, initial instructions drawn from the “Evol Alpaca” dataset are refined through the integration of command instructions and expert LLMs such as ChatGPT. Owing to its commendable performance in both formulating and adhering to intricate instructions, WizardLM has garnered considerable attention. However, its reliance on random sampling of command instructions could fall short in improving the breadth and richness of instructions fed to the LLMs. Furthermore, the heavy dependence on external expert models—commonplace in prevailing self-instruct methods—poses concerns not just economically, but also in terms of environmental impact. Departing from these precedents, our methodology uniquely uses RL—to help generate foundational fine- tuninginstructiondataratherthanforpost-tuningrefinements(suchasRLHF(Stiennonetal.,2020;Ouyang et al., 2022)). Our key technical contribution is the formulation of a Markov Decision Process (MDP) to train an instructor LLM as an RL policy, adeptly tailored for contextualized instruction manipulations, thereby enabling alignment through SFT on small, high-quality datasets. In contrast to WizardLM’s ap- proachoftreatingtextmanipulationsas ordinalselections, ourMDPframeworkencodestheminacontinuous action space, enabling a more refined differentiation of the intricate effects of textual manipulations (carried out by the instructor LLM) on the quality of instructions. Also, unlike the intricate task of crafting well- structured tree instructions required by Tree-Instruct (Zhao et al., 2023b), our MDP can be resolved using TRPO (Schulman et al., 2015) (with other common methods being applicable). This approach alleviates the combinatorial complexity inherent in sequential instruction actions, allowing for iterative policy refinement (with TRPO offering the added benefit of guaranteed monotonic progression). Empirically, we demonstrate the efficacy of SFT alignment on a small but high-quality dataset across benchmarks such as ARC and Hel- laSwag with foundational pre-aligned models such as Llama-1 and Llama-2 (refer to Section 4 for details). Additionally, comparative experiments on model privacy attacks reveal that our method markedly enhances model privacy protection (Section 4.3). 3 Method We first train an instructor LLM policy (Sec. 3.1) based on a continuous action space encoding (Sec. 3.1.1) and diversity rules as a reward function (Sec. 3.1.2), a strategy designed to foster the generation of high- 3 quality instructions with an expert LLM such as ChatGPT. Subsequently, using this policy and the expert LLM, we create an instruction-response dataset to fine-tune a pre-aligned LLM. Importantly, our enriched dataset allows bypassing the usual RLHF phase, resulting in a post-SFT LLM that is already proficient in respondingtocomplexinstructions. Also,theinstructionpolicyistransferableforaligningvariousfoundation LLMs such as Llama-1-chat-7b and Llama-2-chat-7b. Note that our method has been primarily tested on Llama models. There may be performance variations when applying our method to other models. This includes considerations regarding differences in model architecture and computational requirements, which could impact the effectiveness of our approach when transferred to models beyond Llama. 3.1 Training the Instructor LLM Initial instruction S1 S2 S3 S4 S5 S6 S7 S8 S9 A1 A2 A3 A1 A3 A3 A2 A1 A2 final instruction Figure 2: RL policy search for LLM instruction generation. A denotes ac- tions (prompts), S denotes states (in- structions), and the green line indi- cates the optimal instruction genera- tionachievedduringthepolicysearch.To train the RL policy (i.e., instructor LLM1), we utilize the open- sourceWizardLM-13bmodel2,3asthereviewerLLM,whichprovides a diversity score for the current instruction set as the reward signal. WizardLM-13b,beingacost-effectiveoption,cancaptureinstruction nuances for reliable reward evaluation (Sec. 3.1.2). The training procedure comprises several key stages. First, we se- lect a single initial instruction, such as “How to cook food.” This chosen instruction is then input into our instructor model. Second, we leverage the reviewer LLM to evaluate the instructor model’s performance and provide reward signals. The policy training con- cludes if the rewards demonstrate convergence or the iteration limit isreached. Followingthisiterativetrainingphase,weusetheinstruc- tor LLM to generate complex instructions using a tailored action space (Sec. 3.1.1). In the final phase, the reviewer LLM assesses the diversity of the generated instructions to ensure instruction quality (Sec. 3.1.2). To further illustrate how to leverage RL to train the instructor, we first model the RL process with a Markov Decision Process (MDP). An intuitive example is shown in Figure 2, it illustrates RL policy search for LLM instruction generation, where A represents actions (prompts), S represents states (instructions), and the green path denotes the optimal instruction generation. An MDP is present as (S,A,P,r,γ ), whereSdenotesthestatespace, Adenotestheactionspace, thetransitionprobabilityfunction isP:S×A×S→ [0,1], the reward function is denoted as r:S×A→ Randγdenotes the discount factor. In an MDP, we aim to maximize the expected cumulative reward values, which are defined as Qπ(s,a) =E/bracketleftbigg/summationtext∞ t=0γtr(st,at)/vextendsingle/vextendsingle/vextendsingle/vextendsingleπ,s0=s,a0=a/bracketrightbigg andVπ(s) =E/bracketleftbigg/summationtext∞ t=0γtr(st,at)/vextendsingle/vextendsingle/vextendsingle/vextendsingleπ,s0=s/bracketrightbigg for state-action pairs and states, respectively. Moreover, we have the advantage function Aπ(s,a) =Qπ(s,a)−Vπ(s). Thus, the learning problem is formulated as Equation (1), where ρdenotes an initial state distribution. max π∈ΠVπ(ρ):=Es∼ρ[Vπ(s)]. (1) In our framework, reward values are assigned by reviewer LLMs, which assess the diversity of actions within a trajectory. A trajectory with highly diverse actions receives a higher reward. The state st∈Son a trajectoryTrepresents the generative instructions. Action space Ais designed in Sec. 3.1.1, and specifical reward settings are provided in Sec. 3.1.2. On the basis of these settings, we can employ a model-free RL algorithm (e.g., TRPO Schulman et al. (2015)) to train the instructor LLM. 1The instructor LLM consists of an RL network and a WizardLM-13b model. 2https://github.com/nlpxucan/WizardLM 3https://huggingface.co/WizardLM/WizardLM-13B-V1.2 4 3.1.1 Action Set Inspired by WizardLM (Xu et al., 2024), we leverage several actions to generate complex instructions, includ- ing“breadth action”, “add constraint”, “deepening”, “concretizing”, “increase reasoning steps”, “complicate input”(Xu et al., 2024). However, distinct from WizardLM, we map this discrete set into a continuous space by using a language encoding to represent each action, which is then used for a policy to generate instructions. This enables inherently capturing contextual nuances for direct comparison between actions via their Q values. The details of each action are in Appendix C. 3.1.2 Reward Settings In this section, consider an evaluation prompt g. Ifgis characterized as “equal”, the corresponding reward ris designated as 0. Otherwise, ris set to 1. The term “equal” here refers to semantic equivalence. It is noteworthy that the judgment is determined by the reviewer LLM. For instance, when conformity exists between the initial and subsequent instructions, the reward ris 0. In contrast, a lack of conformity yields a reward value of 1. For example, an evaluation prompt = “““ Here are two Instructions to ChatGPT AI, do you think they are equal to each other, which meet the following requirements: 1. They have same constraints and requirments. 2. They have same depth and breadth of the inquiry. The First Instruction: “““ +instruction + ”””. The Second Instruction: “““ + state + ”””. Your Judgement (Must Just only answer: Equal or Not Equal. No need to explain the reason; ‘Equal’ and ‘Not Equal’ are not allowed to appear simultaneously.): ””” . 3.2 Fine-tune a Pre-Aligned LLM As illustrated in Fig. 1, an approach is adopted to optimize cost-efficiency considerations during the phase dedicated to fine-tuning a pre-aligned LLM. Specifically, we use the initial instructions from the Alpaca dataset. The fine-tuning process is executed as follows: We first employ the Alpaca dataset4as the initial set of instructions and input the initial instructions into the expert LLM like ChatGPT. Next, we apply our trained instruction generation policy (derived from the Instructor LLM) to guide the Expert LLM (ChatGPT) in generating high-quality instructions. Specifically, this policy acts as a teacher directing the Expert LLM to produce instructions that meet our quality criteria. The instructor LLM provides different responses during the RL training process. Expert LLM is used to generate instructions and their associated responses, it serves as the source of knowledge and expertise. Then, the expert LLM produces responses to these high-quality instructions. With the instructions and responses in hand, we fine-tune the pre-aligned LLMs like Llama-1-chat-7b and Llama-2-chat-7b via a supervised fine-tuning way (SFT)—a cost-effective way to enhance LLM capabilities. Finally, we can have the post-SFT LLMs, such as TeaMs-RL-1-7b. Algorithm 1 The Pipeline of training LLMs. 1:Design a set Aof actions and reward settings. 2:With reviewer LLMs’ evaluation, leverage the principle of RL and an instructor LLM to search for a policyπ. 3:Utilize the trained policy to teach expert LLMs to generate high-quality instructions and corresponding responses. 4:Fine-tune a pre-aligned LLM with the generated instructions and corresponding responses. 3.3 Practical Algorithms The practical training process is outlined in Algorithm 1, which describes the training pipeline. A more detailed algorithm can be found in Algorithm 2, provided in Appendix B. In our training process, in order to teach expert LLMs how to generate high-quality instructions, we leverage the RL policy we have trained 4https://huggingface.co/datasets/tatsu-lab/alpaca 5 0 50 100 150 200 250 300 350 400 Environment Steps0.30.40.50.60.70.80.91.0Average Episode RewardData Diversity T eaMs-RL (Ours) WizardLMFigure 3: Comparison of our method with WizardLM in terms of data diversity. to prompt expert LLMs to generate instructions step by step. Building on the training process, we fine- tuned an LLM called “TeaMs-RL”. Note that we have the flexibility to choose any other RL methods for policy training, here we choose TRPO (Schulman et al., 2015) (TRPO is not the only choice), though it has longer training times versus alternatives like Proximal Policy Optimization. We select TRPO for its rigorous advantage function handling and theoretical guarantees of monotonic improvement. 4 Experiments 4.1 Enhancement of Instruction Diversity We deploy a policy that is designed to orchestrate a trajectory consisting of six distinct instruction actions, in accordance with the training process to facilitate the generation of instructions by expert LLMs. Note, the expert LLMs utilized in our study, as well as those used by WizardLM, are ChatGPT-3.5 and ChatGPT-4. Both models were accessed in 2023, ensuring that we employed the same versions of the ChatGPT-3.5 and ChatGPT-4 models to generate instructions and responses. A noteworthy aspect of our investigation involves a comparative assessment of the data quality resulting from the utilization of our policy, in contrast to the approach adopted by WizardLM (Xu et al., 2024), which relies upon random sampling for data generation by querying expert LLMs. It bears emphasizing that our instruction actions are randomly initialized, but as training progresses, our policy iteratively learns to enable the instructor LLM to produce increasingly complex and diverse instructions. As shown in Fig. 3, our method reliably enhances the diversity score of the instruction set.5The main computation overhead is learning the instruction policy on the relatively small WizardLM-13b in less than 1 hour on 2 NVIDIA RTX A6000 GPUs (with 896 total queries). This results in a transferable policy that reduces alignment cost across models compared to RLHF’s per-model RL. Offloading to policy learning provides an instruction set for joint tuning and alignment—a substantial benefit over tuning-only data usage. For detailed information on how to teach expert LLMs to generate instructions, please refer to Appendix D. 4.2 Compare with WizardLM-7b on ARC and HellaSwag Benchmakrs To comprehensively examine the effectiveness of our method, we carry out experiments on popular bench- marks: (1) AI2 Reasoning Challenge (ARC) benchmark (Clark et al., 2018): The benchmark introduces a fresh question set and baselines, all strategically curated to foster AI research in the realm of advanced question answering, setting a significantly higher bar for knowledge and reasoning capabilities compared to previous challenges. (2) HellaSwag benchmark (Zellers et al., 2019): The benchmark introduces a challeng- ing dataset, revealing that even state-of-the-art models struggle with commonsense inference, as evidenced 5In designing instruction actions, the “breadth action” is a single action regarding breath thoughts. Thus, we insert the action into the middle of our trajectory to enhance breadth instructions after training a policy. 6 Overall Average ARC HellaSwag Benchmarks01020304050607080Performance Values65.73 62.9 54.35 50.1777.1175.6Comparison experiments with T eaMs-RL-7b and WizardLM-7b T eaMs-RL-7b-v1.1 (Ours) WizardLM-7bFigure 4: Comparison of our method with WizardLM 7B on LM-Eval Benchmarks (the higher the value, the better the method’s performance). Comparison of dataset size used for training LLMs050000100000150000200000250000Number of instructions 17878250000 T eaMs-RL-7b-v1.1 (Ours) WizardLM-7b (a) Comparison of querying ChatGPT times0100000200000300000400000500000600000Times of querying ChatGPT 35756624000T eaMs-RL-7b-v1.1 (Ours) WizardLM-7b (b) Figure 5: Comparison of our method with WizardLM 7B on dataset size used for training LLMs (a) and querying count of advanced LLMs (b) (the lower the value, the better the method’s performance). by the significant performance gap between humans ( 95%accuracy) and models ( 48%), achieved through adversarial filtering, a robust data collection paradigm that selects adversarial machine-generated wrong answers by scaling up the length and complexity of dataset examples to a ‘Goldilocks’ zone where the text generated is absurd to humans yet often misclassified by models. We trained a llama-1-7b model denoted “TeaMs-RL-7b-v1.1” with our dataset of 17,878 instruction-response pairs, the training time is about 2 hours on 4 NVIDIA RTX A6000 GPUs. As shown in Figures 4 and 5, our method shows superior performance over WizardLM-7b models6in the same experimental settings. It is crucial to highlight an important aspect of our methodology in relation to data utilization. The dataset employed for training our model is approximately one-fourteenth the size of the dataset utilized by WizardLM, as illustrated in Figure 5 (a). Furthermore, Fig. 5 (b) highlights the discernible difference in the query count posed to ChatGPT models between our method and WizardLM, with the latter requesting ChatGPT for responses more than seventeen times than TeaMs-RL. This marked contrast underscores thecost-effectiveness of our data generation approach , which mitigates the expenses associated with dataset acquisition and generates high-quality data . It highlights that our method is a more economically viable and sustainable strategy for training LLMs. For more experimental results, please see Appendix G. Note that the ChatGPT models we used to query data show performance equivalent to the ChatGPT models used in the baselines. 6https://huggingface.co/TheBloke/wizardLM-7B-HF 7 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate0.00.20.40.60.81.0True Positive RatePrivacy ROC - Membership Inference Attack T eaMs-RL-ROC curve (AUC = 0.47) (Ours) baseline ROC curve (AUC = 0.72) random guess Figure 6: Privacy Attacks on the Model: Our model demonstrates strong privacy protection performance. The more closely the ROC curve of the model aligns with random guessing, and the closer the AUC value of the model approaches 0.5, the stronger the indication of improved privacy protection by the model. In our comparison experiments, we take the same settings: 25 shots for ARC, 10 shots for HellaSwag. In our experimental setup, all models are configured with a float16 format. We compare to WizardLM-7b since both this approach and ours use llama-1-7b as the base model. Notably, WizardLM-7B queried ChatGPT 624,000 times for responses, whereas our method queried open-source WizardLM13B 896 times during policy training and ChatGPT 35,756 times during high-quality data generation; our total queries are substantially fewer. Therefore, we believe the comparison is fair in terms of matched base model and vastly lower query amount. Even with the query count of WizardLM-13B, 35756 + 896 = 36652 , our method can still reduce the query count by 94.13%to reach comparable outcomes, emphasizing a more economical and sustainable strategy for LLM training. 4.3 Model Privacy Attack Experiments AI safety has increasingly adopted data synthesizers designed to produce differentially private datasets to mitigate the risk of inadvertent data leakage (Dong et al., 2022). These datasets serve as the foundational element for training machine learning algorithms. However, it presents a dilemma wherein practitioners have to choose between large training data and data privacy. Addressing this problem, our method tries to handle the dilemma of large training data and data privacy. It aims to enhance the performance of training models while ensuring improved data privacy, even with limited data. As illustrated in Fig. 6, we have conducted a series of membership inference attack (Shokri et al., 2017; Carlini et al., 2022) experiments to assess our model’s privacy performance rigorously. Our model exhibits a Receiver Operating Characteristic (ROC) curve that closely approximates random guessing, yielding an Area Under the Curve (AUC) value of 0.47in this evaluation. Conversely, the baseline model, trained on a dataset comprising 44,000 samples, displays a notable deviation from random guessing, with an AUC value of0.72. In model privacy assessment, a closer proximity of the ROC curve to the random guess curve indicates better model privacy protection performance, while an AUC value approaching 0.5further suggests better model privacy protection (Ye et al., 2022). The experiment results demonstrate the substantial enhancement in model privacy protection performance achieved by our model relative to the baseline model. 4.4 Comparison Experiments on Alpaca Eval Benchmarks To comprehensively evaluate the effectiveness of our method, we carried out additional experiments using the apacal eval benchmark7, a comprehensive tool for evaluating model performance across a diverse range 7https://github.com/tatsu-lab/alpaca_eval 8 of tasks. As shown in Table 1, the results of these experiments demonstrate that our model attains a win rate of 13%. In comparison, the baseline model, WizardLM-7b Xu et al. (2024), achieves a win rate of only 4.84%. The experimental results further confirm that our method surpasses the performance of WizardLM Xu et al. (2024). For the details of the apacal eval Leaderboard, please see the Link: https: //tatsu-lab.github.io/alpaca_eval/ . MethodTaskapacal eval benchmark TeaMs-RL-1-7b (ours) 13% WizardLM-7b 4.84% Table 1: Comprehensive evaluation experiments on the apacal eval benchmark: Our model achieved a win rate of 13%, significantly higher than the baseline model, WizardLM-7b, which recorded a win rate of 4.84%. 4.5 Comparison Experiments of Solving General Tasks In this evaluation, we present scenarios where random tasks are provided, and the subsequent response is generated by our model (TeaMs-RL-1-7b) and baseline models. Notably, our model performs exceedingly better than the baseline models in providing a well-structured and accurate response to the given instruction. More cases are provided in Appendix H and Appendix I. Task: “Please suggest two examples of animals that fly.” TeaMs-RL-1-7b output: “Certainly, here are two examples of animals that fly: 1.Birds: Birds are one of the most well-known examples of animals that fly. They have evolved wings that are lightweight, flexible, and covered with feathers. Birds can fly by flapping their wings or gliding through the air using air currents. They can also soar in the air without flapping their wings, which is called sustained flight. Some birds, like the albatross, can even fly for days without stopping. 2.Insects: Insects, such as bees, wasps, and moths, are also capable of flight. They have wings that are lightweight and covered with a thin layer of skin called cuticle. Insects fly by beating their wings, which creates lift and propels them through the air. Some insects, like moths, can also hover in the air by adjusting the angle of their wings. These are just two examples of animals that fly. There are many more, such as bats, pterosaurs, and even some species of fish that can glide through the water using their fins.” llama-1-7b output: “I’m sorry, I don’t understand the question.” WizardLM-7b output: “Certainly! Two examples of animals that fly are birds and bats. Birds are known for their ability to fly using wings, while bats use wings to glide through the air. Both are amazing creatures that have evolved to navigate the skies. Is there anything else I can help you with?” 4.6 Comparison Experiments of Solving a Math Problem We conduct comparison experiments with several strong baseline models on mathematical tasks. The experi- mental procedure entails inputting instructions into LLMs, with the LLMs subsequently generating responses aligned with the provided instructions. The empirical findings from our study reveal that the post-alignment LLM, TeaMs-RL-7b, exhibits superior mathematical capabilities compared to the baseline models. This enhanced performance is noteworthy, particularly when it surpasses even those models with a larger param- eter count, such as the 13b models. This outcome underscores the efficacy of our alignment and training methodologies in augmenting the specific skill set of LLMs, particularly in mathematical problem-solving. More results are introduced in Appendix J. 9 Task: “Given that f(x) = 5x3−2x+ 3, find the value of f(2).” TeaMs-RL-7b output: “Hello there! I’m happy to help you with your question. To find the value of f(2), we need to plug 2 into the function f(x) = 5x3−2x+ 3. So,f(2) = 5(2)3−2(2) + 3 = 5(8)−2(2) + 3 = 40−4 + 3 = 39 . Therefore, the value of f(2) is 39. I hope this helps! Let me know if you have any other questions.” The output is correct. WizardLM-7b output: “Sure, I can help you with that! To find the value of f(2), we simply need to substitute x = 2 into the function and simplify: f(2) = 5(2)3−2(2) + 3 f(2) = 30 - 4 + 3 f(2) = 21 Therefore, the value of f(2) is 21. Is there anything else I can help you with?” The output is wrong. Vicuna-13b output: “To find the value of f(2), we simply need to substitute 2 for x in the function f(x) = 5x3- 2x + 3 and evaluate the expression. So, f(2) = 5(2)3- 2(2) + 3= 5(4) - 2(2) + 3= 20 - 4 + 3= 13 Therefore, the value of f(2) is 13.” The output is wrong. Llama-2-chat-13b output: “Hello! I’d be happy to help you with that! To find the value of f(2), we need to plug 2 into the function f(x) = 5x3- 2x + 3. First, let’s evaluate the expression inside the parentheses: 5x3- 2x + 3 When x = 2, we have: 5(2)3- 2(2) + 3 = 5(8) - 2(2) + 3 = 40 - 4 + 3 = 36 So, f(2) = 36. Is there anything else you’d like to know?” The output is wrong. ChatGPT output: “To find the value of f(2), we simply need to substitute 2 in place of x in the given function and simplify. f(2) = 5(2)3- 2(2) + 3 f(2) = 5(8) - 4 + 3 f(2) = 40 - 1 f(2) = 39 Therefore, the value of f(2) is 39.” The output is correct. 4.7 Compared to Llama-2-7b-chat Models As depicted in Table 2, we conduct a comparative analysis of our model against the llama-2-chat-7b model, which is trained using RLHF. The experimental findings indicate that our model, which is developed on the llama-1-7b framework, surpasses the llama-2-chat-7b in performance across both the ARC and Hellaswag benchmarks. Specifically, our model achieved a score of 78.35 on the Hellaswag benchmark and 55.89 on the ARC benchmark, which are markedly higher than the scores of 77.74 and 53.07, respectively, recorded by the llama-2-chat-7b on these benchmarks. This analysis underscores the superior efficacy of our approach. 10 MethodTaskARC Hellaswag TeaMs-RL-1-7b (ours) 55.89 78.35 llama-2-chat-7b 53.07 77.74 Table2: A comparisonexperiment regarding ourmodel and a llama-2-chat-7b model: our model outperforms the llama-2-chat-7b on the ARC and Hellaswag benchmarks, scoring 78.35 and 55.89, respectively, versus 77.74 and 53.07 achieved by the llama-2-chat-7b. 4.8 Ablation Experiments Regarding Data size To further evaluate the effectiveness of our proposed method, we conduct ablation experiments to analyze the impact of dataset size, as summarized in Table 3. Expanding the dataset from 17,878 to 19,395 entries led to notable performance improvements for the TeaMs-RL-7b model on both the Hellaswag and ARC Challenge benchmarks. Specifically, the model’s performance on the Hellaswag benchmark improves from 77.17 to 78.35, while on the ARC Challenge benchmark, it increases from 54.35 to 55.89. Further expanding the dataset from 19,395 to 21,396 entries yields more improvements: the Hellaswag benchmark score increases from 78.35 to 78.59, and the ARC Challenge score improves from 55.89 to 56.74. These results provide further confirmation of our method’s effectiveness. It is worth noting that the performance improvements are not strictly linear with the increase in dataset size; however, adding more data generally proves beneficial for enhancing model performance. Data SizeTaskARC Hellaswag 17,878 54.35 77.17 19,395 55.89 78.35 21,396 56.74 78.59 Table 3: Ablation experiments regarding the impact of data size: Expanding the dataset from 17,878 to 21,396 entries improves the performance of the TeaMs-RL-7b model on both the Hellaswag and ARC Challenge benchmarks. 5 Conclusion In this study, we depart from the traditional RLHF paradigm and introduce a method that amplifies instruc- tion quality while significantly cutting the costs linked to querying proprietary LLMs such as ChatGPT, a feat realized through strategic RL application in autonomous instruction set generation. Leveraging this methodology, we teach LLMs to generate high-quality data , refine a foundational model, and conduct comprehensive experiments to assess its efficacy. Remarkably, the LLM trained under this framework rivals the performance of the acclaimed WizardLM, despite being constrained by a significantly smaller dataset and fewer query instances: our dataset amounts to a mere 6.75%of WizardLM’s, and the query counts to ChatGPT are only 5.73%of what WizardLM uses. This highlights the economic and sustainable advantages of our approach, underscoring its potential to enhance data quality within budgetary confines. Moreover, our method can improve model performance while effectively mitigating model privacy leakage risks. Our experiments clearly demonstrate substantially enhanced privacy protection over the baseline model. Beyond mere practical implications, our findings question the conventional two-stage LLM training pipeline, suggesting that perhaps it is possible to train proficient LLMs without the necessity of human feedback. It beckons a rethinking of the pivotal role humans play in LLM training, urging a more judicious deployment of human resources to truly critical facets of the training process. 11 6 Limitations Our method leverages RL to teach LLMs how to generate high-quality instruction fine-tuning data, aiming to diminish collection costs and the dependency on human alignment; however, it is important to note that this research did not explore the potential benefits of incorporating an additional stage of human feedback, which could potentially enhance alignment. This leaves open the question of whether data derived from expert LLMs sufficiently aligns with human values or inherits human value deficiencies, such as biases. While our approach successfully reduces the reliance on external models like expert LLMs, it does not entirely eliminate their use. The full extent to which these models can be excluded remains an area for future exploration. Hence, while our work paves the way for greater autonomy and efficiency in data alignment, it simultaneously raises more questions than it answers, necessitating additional exploration. Moreover, the policy employed in instructing expert LLMs is trained using specific instructions, e.g., “How to cook food”. This policy is then applied to guide expert LLMs in generating instructions from various other initial inputs. However, it is important to note that this policy may not be universally precise for all initial instructions. Training a distinct policy for each initial instruction presents a significant challenge, as it could be both time-consuming and costly. In the future, our research will delve into the complex interplay between these policies’ performance and their training’s associated costs. Additionally, we aim to explore the alignment of LLMs with human values, examining how these aspects can be balanced and optimized. This investigation is crucial for enhancing the efficiency and applicability of LLM training, ensuring both practicality and alignment with ethical standards. References Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 , 2022. Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861 , 2021. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020. Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP) , pp. 1897–1914. IEEE, 2022. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 , 2018. Payal Dhar. The carbon impact of artificial intelligence. Nat. Mach. Intell. , 2(8):423–425, 2020. Tian Dong, Bo Zhao, and Lingjuan Lyu. Privacy for free: How does dataset condensation help privacy? In International Conference on Machine Learning , pp. 5378–5396. PMLR, 2022. Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, and Yu Qiao. Drive like a human: Rethinking autonomous driving with large language models. arXiv preprint arXiv:2307.07162 , 2023. Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. Bias and fairness in large language models: A survey. arXiv preprint arXiv:2309.00770 , 2023. 12 Myeongseob Ko, Ming Jin, Chenguang Wang, and Ruoxi Jia. Practical membership inference attacks against large-scale multi-modal models: A pilot study. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 4871–4881, 2023. Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning.arXiv preprint arXiv:2301.13688 , 2023. OpenAI. Gpt-4 technical report, 2023. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022. Allen Z Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, et al. Robots that ask for help: Uncertainty alignment for large language model planners. arXiv preprint arXiv:2307.01928 , 2023. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning , pp. 1889–1897. PMLR, 2015. Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP) , pp. 3–18. IEEE, 2017. Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems , 33:3008–3021, 2020. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction . MIT press, 2018. Yun Tang, Antonio A Bruto da Costa, Jason Zhang, Irvine Patrick, Siddartha Khastgir, and Paul Jennings. Domain knowledge distillation from large language model: An empirical study in the autonomous driving domain. arXiv preprint arXiv:2307.11769 , 2023. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/ tatsu-lab/stanford_alpaca , 2023. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 , 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bash- lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023b. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Han- naneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560 , 2022. Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations , 2021. Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang,FionaAga,JinshiHuang,CharlesBai,etal. Sustainableai: Environmentalimplications,challenges and opportunities. Proceedings of Machine Learning and Systems , 4:795–813, 2022. 13 Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. WizardLM: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Representations , 2024. URL https://openreview. net/forum?id=CfXh93NDgH . Jiayuan Ye, Aadyaa Maddi, Sasi Kumar Murakonda, Vincent Bindschaedler, and Reza Shokri. Enhanced membership inference attacks against machine learning models. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security , pp. 3093–3106, 2022. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pp. 4791–4800, 2019. Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Be- ichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223 , 2023a. Yingxiu Zhao, Bowen Yu, Binyuan Hui, Haiyang Yu, Fei Huang, Yongbin Li, and Nevin L Zhang. A preliminary study of the intrinsic relationship between complexity and alignment. arXiv preprint arXiv:2308.05696 , 2023b. 14 A A Methodological Example To clarify our methodology and facilitate replication, we provide a detailed example illustrating the inputs and outputs at each stage of our process, similar to Figure 1 in our paper. Stage 1: Policy Training Components: •RL Policy: (derived from Instructor Model) Learns to select actions based on the reviewer’s feed- back to improve the instructions. •Reviewer Model: Provides feedback on the instructions. Example: 1.RL Policy Training With any Initial Instructions such as: “Describe the process of photosynthesis.” The RL policy with Instructor Models learns to select actions that diversify instructions based on feedback from the reviewer model. 2.RL Policy Training: Based on the reviewer’s feedback, the RL policy learns to select actions such as: •Add Constraints •Deep Reasoning •Width Reasoning Stage 2: RL Policy Action Selection Selected Action: "Add Constraints" Purpose: To make the instruction more challenging and comprehensive by adding specific constraints or requirements. Stage 3: Guiding Expert LLMs Purpose: To make the instruction more challenging and comprehensive by adding specific constraints or requirements. RL Policy Action Selection such as: "Add Constraints" The RL policy generates a specialized prompt based on this action to guide the expert LLM in rewriting the instruction. Add Constraints: Action Prompt to Expert LLM: 15 “““ I want you to act as a Prompt Rewriter. Your objective is to rewrite the given prompt into a more complex version to make it more challenging for AI systems like ChatGPT and GPT-4. Ensure that the rewritten prompt remains reasonable, understandable, and suitable for human response. Do not omit any non-text parts such as tables or code in the given prompt. Do not repeat conditions or requirements in your response, and do not disclose your role. Provide only the rewritten prompt without any introduction or explanation. The new prompt should not exceed 2048 words. You should complicate the given prompt by adding one more constraint or requirement. Try not to make the rewritten prompt verbose; you can only add or replace 10 to 20 words in the given prompt. Do not include phrases like ’Given Prompt’ or ’Rewritten Prompt’ in your response. Given Prompt: "Describe the process of photosynthesis." Stage 4: Expert LLM Generates Rewritten Instruction Expert LLM Output: “Describe the process of photosynthesis in plants and explain how it varies in different environ- mental conditions.” Explanation: •Original Instruction: Simple and straightforward. •Rewritten Instruction: Adds the constraint of explaining variations in different environmental conditions, increasing complexity and depth. •Compliance with Guidelines: The expert LLM added approximately 12 words, adhering to the limit of adding or replacing 10 to 20 words. Oncewehavethegenerationinstructions, weusethemtoquerytheexpertmodelforcorrespondingresponses. After obtaining the responses, we have the final dataset (instructions and corresponding responses), which is then used to fine-tune foundation models. Summary This example demonstrates how our RL policy guides expert LLMs to generate more complex and high- quality instructions by selecting appropriate actions. The process ensures that the generated instructions are challenging yet reasonable, facilitating the creation of a valuable dataset for training advanced AI models. B Practical Algorithm Here, we introduce Trust Region Policy Optimization (TRPO) (Schulman et al., 2015), which is out- lined in Algorithm 3. In this context, ρπrepresents the discounted state visitation frequencies, while ϵ= maxs,a|Aπ(s,a)|denotes the maximum absolute advantage function value across all states and ac- tions. The performance objective, η(π), is defined as the expected cumulative reward over time, η(π) = Es0,a0,...[/summationtext∞ t=0γtr(st)], whereγis the discount factor. 16 Algorithm 2 TeaMs-RL: Teaching LLMs to Generate Better Instruction Datasets via RL. 1:Initial dataset D1, instruction dataset D2, response dataset D3, time step t, parameters θ, the number of instructions and responses n, RL batch size number x. 2:With reviewer LLMs’ evaluation, leverage the principle of RL and an instructor LLM to search for a policyπ. 3:fort←1toTdo 4:Collect trajectories Dt={τx}with policy πtin the reviewer LLM interactions. 5:Compute advantage Atand estimate gradient gk=1 |Dt|/summationdisplay τ∈Dtτ/summationdisplay t=0∇θlogπ(at|st,θt)At. 6:Use the methods of conjugate gradient and line search to search the policy parameters θk+1via TRPO. 7:end for 8:Receive the trained policy π′that is parameterized by θT 9:fori←1tondo 10:Leverage the trained policy π′to teach expert LLMs to generate high-quality instruction i, which can be used to build dataset D2with the initial dataset D1. 11:end for 12:forj←1tondo 13:Utilize expert LLMs to query response jfor the generated instruction i, which can be used to build datasetD3with the initial dataset D2. 14:end for 15:Fine-tune a pre-aligned LLM with the generated instructions and corresponding responses. Algorithm 3 TRPO(Schulman et al., 2015) Initializeπ0. fori= 0,1,2,...until convergence do Compute all advantage values Aπi. Solve the constrained optimization problem πi+1= arg max π[Lπi(π)−CDmax KL(πi,π)] whereC=4ϵγ (1−γ)2andLπi(π) =η(πi) +/summationtext sρπi(s)/summationtext aπ(a|s)Aπi end for 17 C Action Set breadth action: evol prompt = “““I want you act as a Prompt Creator. Your goal is to draw inspiration from the #Given Prompt# to create a brand new prompt. This new prompt should belong to the same domain as the #Given Prompt# but be even more rare. The LENGTH and difficulty level of the #Created Prompt# should be similar to that of the #Given Prompt#. Don’t repeat the conditions and requirements in the response, and Don’t disclose your role. The Prompt Rewriter Must not give the introduction and explain the reason, the Prompt Rewriter must just give the most relevant response. This new prompt should not exceed 2048 words. The #Created Prompt# must be reasonable and must be understood and responded by humans. ‘#Given Prompt#’, ‘#Created Prompt#’, ‘given prompt’ and ‘created prompt’ are not allowed to appear in #Created Prompt#. #Given Prompt#: """ + instruction add constraints: evol prompt = “““ I want you act as a Prompt Rewriter. Your objective is to rewrite a given prompt into a more complex version to make those famous AI systems (e.g., chatgpt and GPT4) a bit harder to handle. But the rewritten prompt must be reasonable and must be understood and responded by humans. Your rewriting cannot omit the non-text parts such as the table and code in #Given Prompt#:. Also, please do not omit the input in #Given Prompt#. Don’t repeat the conditions and requirements in the response, and Don’t disclose your role. The Prompt Rewriter Must not give the introduction and explain the reason, the Prompt Rewriter must just give the most relevant response. This new prompt should not exceed 2048 words. You SHOULD compli- cate the given prompt using the following method: Please add one more constraints/requirements into #Given Prompt# You should try your best not to make the #Rewritten Prompt# become verbose, #Rewritten Prompt# can only add or replace 10 to 20 words into #Given Prompt#. ‘#Given Prompt#’, ‘#Rewritten Prompt#’, ‘given prompt’ and ‘rewritten prompt’ are not al- lowed to appear in #Rewritten Prompt#. #Given Prompt#: """ + instruction deepening: evol prompt = “““ I want you act as a Prompt Rewriter. Your objective is to rewrite a given prompt into a more complex version to make those famous AI systems (e.g., chatgpt and GPT4) a bit harder to handle. But the rewritten prompt must be reasonable and must be understood and responded by humans. Your rewriting cannot omit the non-text parts such as the table and code in #Given Prompt#:. Also, please do not omit the input in #Given Prompt#. Don’t repeat the conditions and requirements in the response, and Don’t disclose your role. The Prompt Rewriter Must not give the introduction and explain the reason, the Prompt Rewriter must just give the most relevant response. This new prompt should not exceed 2048 words. You SHOULD complicate the given prompt using the following method: If #Given Prompt# contains inquiries about certain issues, the depth and breadth of the inquiry can be increased. You should try your best not to make the #Rewritten Prompt# become verbose, #Rewritten Prompt# can only add 10 to 20 words into #Given Prompt#. ‘#Given Prompt#’, ‘#Rewritten Prompt#’, ‘given prompt’ and ‘rewritten prompt’ are not allowed to appear in #Rewritten Prompt#. #Given Prompt#: """ + instruction 18 concretizing: evol prompt = “““ I want you act as a Prompt Rewriter. Your objective is to rewrite a given prompt into a more complex version to make those famous AI systems (e.g., chatgpt and GPT4) a bit harder to handle. But the rewritten prompt must be reasonable and must be understood and responded by humans. Your rewriting cannot omit the non-text parts such as the table and code in #Given Prompt#:. Also, please do not omit the input in #Given Prompt#. Don’t repeat the conditions and requirements in the response, and Don’t disclose your role. The Prompt Rewriter Must not give the introduction and explain the reason, the Prompt Rewriter must just give the most relevant response. This new prompt should not exceed 2048 words. You SHOULD complicate the given prompt using the following method: Please replace general concepts with more specific concepts. You should try your best not to make the #Rewritten Prompt# become verbose, #Rewritten Prompt# can only add 10 to 20 words into #Given Prompt#. ‘#Given Prompt#’, ‘#Rewritten Prompt#’, ‘given prompt’ and ‘rewritten prompt’ are not allowed to appear in #Rewritten Prompt#. #Given Prompt#: """ + instruction increase reasoning steps: evol prompt = “““ I want you act as a Prompt Rewriter. Your objective is to rewrite a given prompt into a more complex version to make those famous AI systems (e.g., chatgpt and GPT4) a bit harder to handle. But the rewritten prompt must be reasonable and must be understood and responded by humans. Your rewriting cannot omit the non-text parts such as the table and code in #Given Prompt#:. Also, please do not omit the input in #Given Prompt#. Don’t repeat the conditions and requirements in the response, and Don’t disclose your role. The Prompt Rewriter Must not give the introduction and explain the reason, the Prompt Rewriter must just give the most relevant response. This new prompt should not exceed 2048 words. You SHOULD complicate the given prompt using the following method: If #Given Prompt# can be solved with just a few simple thinking processes, you can rewrite it to explicitly request multiple-step reasoning. You should try your best not to make the #Rewritten Prompt# become verbose, #Rewritten Prompt# can only add 10 to 20 words into #Given Prompt#. ‘#Given Prompt#’, ‘#Rewritten Prompt#’, ‘given prompt’ and ‘rewritten prompt’ are not allowed to appear in #Rewritten Prompt#. #Given Prompt#: """ + instruction complicate input: evol prompt = “““ I want you act as a Prompt Rewriter. Your objective is to rewrite a given prompt into a more complex version using dataformat to make those famous AI systems (e.g., chatgpt and GPT4) more difficult to handle. But the rewritten prompt must be reasonable and must be understood and responded by humans. Don’t repeat the conditions and requirements in the response, and Don’t disclose your role. The Prompt Rewriter Must not give the introduction and explain the reason, the Prompt Rewriter must just give the most relevant response. This new prompt should not exceed 2048 words. The Given Prompt: """+ instruction D Teaching LLMs to Generate Instructions via Trained Policy To teach expert LLMs in the generation of high-quality instructions, we employ an RL policy that we have trained. This policy prompts the expert LLMs, enabling them to produce high-quality instructions incre- mentally. This process is achieved by synergizing the initial instructions and ensuring the output enhances each step. This approach facilitates the expert LLMs in refining their instructional outputs, harnessing the RL policy for optimal effectiveness in instruction generation. The related cases are in subsections D.1 and D.2. It is noteworthy that these initial instructions are culled from a well-established and widely-recognized dataset, namely the Alpaca dataset8. The experiment settings are available in Appendix K. 8https://github.com/tatsu-lab/alpaca_eval.git 19 D.1 Teaching ChatGPT-4 to Generate Imaginative Instructions An initial instruction: “Describe the economic system of the United States.” Generated instructions by teaching ChatGPT-4 with RL policy: (1) “Elucidate the economic framework prevalent in the United States, taking into account its relationship with environmental sustainability.” (2) “Examine the monetary structure dominating in the European Union, while considering its connection to ecological preservation.” (3) “Investigate the prevailing monetary framework within the European Union, takingintoaccountitsassociationwithenvironmentalconservation, andevaluatetheimpactofoneadditional factor, such as economic growth or political stability.” For more cases, please see Appendix E. D.2 Teaching ChatGPT-3.5 to Generate Imaginative Instructions An initial instruction: “Find a unique way to use a rubber band.” Generated instructions by teaching ChatGPT-3.5 with RL policy: (1) “Discover an unconventional and sin- gular application for an elastic band, ensuring that it adheres to environmentally friendly practices.” (2) “Develop a groundbreaking and ecologically considerate utilization for a stretchable loop, ensuring adher- ence to the principles of enduring viability, while elaborating on the prospective enduring environmental advantages it may offer.” (3) “Devise a groundbreaking, environmentally-conscious application for a scalable band, contemplating prolonged durability and sustainability, while elaborating on the enduring ecological advantages it could potentially offer.” For more cases, please see Appendix F. E Generate Imaginative Instructions by Teaching ChatGPT-4 E.1 Instruction 1 An initial instruction: “Build a bridge between two islands." 20 Generated instructions: (1) “Construct a bridge connecting two islands while ensuring it can withstand the environmental conditions and weight capacity requirements of the region.", (2) “Design a tunnel system that links two archipelagos, taking into consideration the geological features, pressure resistance, and necessary ventilation for safe travel.", (3) “Devise an interconnected subterranean network connecting two archipelagos, factoring in geological characteristics, pressure endurance, requisite airflow for secure passage, and minimal ecological disruption.", (4) “Design an interlinked underground system joining two island chains, incorporating geological features, resistance to pressure, adequate ventilation for safe traversal, and minimal environmen- tal impact, while ensuring cost-efficiency.", (5) “Create a blueprint for a subaquatic tunnel network connecting multiple archipelagos, taking into account underwater terrain, structural integrity against water pressure, sufficient air circulation for travelers, reduced ecological disturbance, and budget-conscious construction.", (6) “Devise a schematic representation of an underwater tunnel system linking numerous is- land chains, factoring in the topography of the ocean floor, resilience to hydrostatic forces, ade- quate ventilation for commuters, minimized environmental disruptions, and cost-effective building methodologies." E.2 Instruction 2 An initial instruction: “Describe the economic system of the United States." Generated instructions: (1) “Elucidate the economic framework prevalent in the United States, taking into account its relationship with environmental sustainability.", (2) “Examine the monetary structure dominating in the European Union, while considering its connection to ecological preservation.", (3) “Investigate the prevailing monetary framework within the European Union, taking into ac- count its association with environmental conservation, and evaluate the impact of one additional factor, such as economic growth or political stability." E.3 Instruction 3 A initial instruction: “Convert 28 cm to inches." 21 Generated instructions: (1) “In order to convert 28 centimeters into inches, taking into account that 1 inch is equivalent to 2.54 centimeters, and provide the result with two decimal places of precision.", (2) “Calculate the length in inches of a 35-centimeter object, considering that 1 inch equals 2.54 centimeters, and present your answer rounded to two decimal places.", (3) “Determine the measurement in inches for an object with a length of 35 centimeters, taking into account the conversion factor of 1 inch being equivalent to 2.54 centimeters, and provide the resulting value with a precision of two decimal points.", (4) “Ascertain the dimensions in inches of an item possessing a length of 35 centimeters, factoring in the conversion ratio of 1 inch corresponding to 2.54 centimeters, and supply the consequent value, ensuring accuracy to a couple of decimal places.", (5) “Determine the weight in pounds of an object with a mass of 45 kilograms, considering the conversion rate of 1 kilogram equaling 2.20462 pounds, and provide the resulting value with precision up to two decimal points.", (6) “Calculate the weight in pounds of an item possessing a mass of 45 kg, taking into account the conversion factor of 1 kg being equivalent to 2.20462 lbs, and present the outcome with accuracy to two decimal places." E.4 Instruction 4 A initial instruction: “Predict the weather in Los Angeles tomorrow." Generated instructions: (1) “Forecast the atmospheric conditions for the city of Los Angeles, California, specifically for the upcoming day.", (2) “Predict the weather patterns for San Francisco, California, focusing on the next 24 hours.", (3) “Forecast the atmospheric conditions for San Francisco, California, with a particular emphasis on the upcoming 24-hour period, taking into account temperature, precipitation, and wind patterns.", (4) “Predict the meteorological conditions in San Francisco, California, focusing specifically on the subsequent 24-hour timeframe, incorporating temperature, precipitation, wind patterns, and humidity levels.", (5) “Analyze the atmospheric conditions in Seattle, Washington, for the upcoming 48-hour period, concentrating on factors such as air pressure, cloud coverage, visibility, and dew point.", (6) “Examine and provide a detailed assessment of the meteorological conditions and atmospheric parameters in Seattle, Washington, for the subsequent 48-hour time frame. This analysis should predominantly focus on elements such as barometric pressure, the extent of cloud cover, range of visibility, and the dew point temperature." E.5 Instruction 5 A initial instruction: “Tell me about the seasons in the temperate forest biome." 22 Generated instructions: (1) “Elaborate on the distinct seasonal variations observed within the deciduous forest biome, encompassing climate and ecological adaptations.", (2) “Discuss the unique cyclic changes experienced in the coniferous forest ecosystem, focusing on weather patterns and the adaptive strategies of its flora and fauna.", (3) “Elaborate on the distinct cyclical transformations occurring within the coniferous forest ecosystem, emphasizing meteorological fluctuations and the specialized adaptation techniques employed by its plant and animal species.", (4) “Examine the unique cyclical transformations taking place in the coniferous forest ecosystem, with an emphasis on meteorological fluctuations, the specialized adaptation techniques employed by its plant and animal species, and the interdependence between biotic and abiotic factors.", (5) “Analyze the distinct seasonal changes occurring in the alpine tundra ecosystem, focusing on the influence of climatic variations, the specific survival strategies utilized by its flora and fauna, and the mutual relationships between living and non-living elements.", (6) “Examine the unique seasonal transformations in the alpine tundra biome, emphasizing the impact of climatic fluctuations, the specialized adaptation mechanisms employed by its plant and animal species, and the interdependent connections between biotic and abiotic components." E.6 Instruction 6 A initial instruction: “Generate a list of items for a vegetarian Thanksgiving dinner." Generated instructions: (1) “Devise a strategy for a Thanksgiving feast that is not only economical but also incorporates at least one vegetarian dish to accommodate diverse dietary preferences.", (2) “Plan a budget-friendly Easter dinner that includes a vegan main course option to cater to various dietary needs.", (3) “Devise an economical Easter dinner plan, encompassing a vegan main course alternative and an additional gluten-free side dish, in order to accommodate a range of dietary preferences.", (4) “Construct an affordable Easter dinner strategy, incorporating a vegan main course option and an extra gluten-free side dish, while also ensuring the inclusion of a low-sodium appetizer, to cater to a variety of dietary inclinations.", (5) “Design a budget-friendly Easter meal plan that features a plant-based centerpiece and an additional side dish without gluten, as well as an appetizer low in salt content, to accommodate various dietary preferences.", (6) “Devise an economical Easter meal plan, encompassing a plant-based focal dish, an accom- panying side free of gluten, and an appetizer with minimal sodium, while also incorporating a dessert that is devoid of nuts, catering to a wide array of dietary inclinations." E.7 Instruction 7 A initial instruction: “Name two characters in the novel Moby Dick." 23 Generated instructions: (1) “Identify a pair of individuals featured in Herman Melville’s literary work, Moby Dick, while also specifying the roles they play within the narrative.", (2) “Examine two characters from Nathaniel Hawthorne’s novel, The Scarlet Letter, and describe the significance of their roles in the story.", (3) “Delve into the intricacies of two characters from Nathaniel Hawthorne’s literary work, The Scarlet Letter, and elucidate the importance of their contributions to the overarching narrative and themes within the novel.", (4) “Examine and dissect the complexities of a pair of characters featured in Nathaniel Hawthorne’s eminent piece of literature, The Scarlet Letter. Provide a comprehensive analysis of the significance of their individual and collective roles in augmenting the overarching storyline and the prevailing motifs embedded throughout the novel.", (5) “Analyze and explore the intricate dynamics between two characters from F. Scott Fitzgerald’s renowned work, The Great Gatsby. Offer a thorough evaluation of how their individual and combined contributions enhance the central narrative and reinforce the dominant themes woven throughout the story.", (6) “Delve into the complex interplay between two characters from F. Scott Fitzgerald’s acclaimed novel, The Great Gatsby, and provide an in-depth assessment of how their singular and mutual in- teractions augment the core storyline while bolstering the predominant motifs interspersed within the tale." E.8 Instruction 8 A initial instruction: “Name 6 components of an artificial neural network". 24 Generated instructions: (1) “Identify 6 distinct elements that constitute an artificial neural network, including various layers and associated features.", (2) “Enumerate half a dozen unique components that form the structure of an artificial neural network, encompassing diverse stratums and their corresponding characteristics.", (3) “Enumerate six distinct elements constituting the architecture of an artificial neural network, encompassing various hierarchical layers and their corresponding attributes, such as input, hidden, and output layers, along with activation functions, weights, and biases.", (4) “List six unique components that make up the structure of a deep learning model, including different levels of organization and their associated features, such as convolutional, recurrent, and fully connected layers, as well as loss functions, optimization algorithms, and regularization techniques.", (5) “Enumerate six distinct elements constituting the architecture of a deep learning paradigm, encompassing various organizational strata and their pertinent characteristics, like convolutional, recurrent, and fully connected strata, in addition to loss computation methods, optimization procedures, and regularization approaches.", (6) “Identify seven unique components that form the structure of an advanced machine learning framework, including different layers and their relevant features, such as attention, transformer, and radial basis function layers, along with performance evaluation techniques, training strategies, and hyperparameter tuning methods." F Generate Imaginative Instructions by Teaching ChatGPT-3.5 F.1 Instruction 1 An initial instruction: “Find a unique way to use a rubber band." 25 Generated instructions: (1) “Discover an unconventional and singular application for an elastic band, ensuring that it adheres to environmentally friendly practices.", (2) “Invent a unique and eco-conscious use for a rubber band, while ensuring that its implemen- tation aligns with sustainable principles.", (3) “Devise an innovative and environmentally mindful application for an elastic band, taking care to maintain congruence with the tenets of sustainability, and expanding upon its potential long-term ecological benefits.", (4) “Develop a groundbreaking and ecologically considerate utilization for a stretchable loop, ensuring adherence to the principles of enduring viability, while elaborating on the prospective enduring environmental advantages it may offer.", (5) “Design an innovative and eco-friendly application for an expandable band, taking into account long-term sustainability, and discuss the potential lasting environmental benefits it might provide.", (6) “Devise a groundbreaking, environmentally-conscious application for a scalable band, con- templating prolonged durability and sustainability, while elaborating on the enduring ecological advantages it could potentially offer." F.2 Instruction 2 An initial instruction: “Rate the importance of creativity in life and give one example of how it helps you." Generated instructions: (1) “Assess the significance of inventive thinking in one’s existence, and provide a singular illustration demonstrating its beneficial impact on your personal experiences.", (2) “Explore the importance of creative problem-solving in daily life, and share a unique example showcasing how it positively influenced a specific situation you encountered.", (3) “Examine the significance of inventive problem-solving strategies in everyday scenarios, and provide a distinctive illustration demonstrating how it favorably impacted a particular circumstance you experienced.", (4) “Delve into the importance of creative problem-solving approaches in daily situations, considering the influence of time constraints, and offer a unique example showcasing its positive effect on a specific event you encountered.", (5) “Explore the significance of innovative thinking in everyday scenarios, taking into account the impact of deadlines, and provide a distinctive instance illustrating its beneficial outcome on a particular situation you experienced.", (6) “Investigate the importance of groundbreaking thought processes in daily circumstances, con- sidering the influence of time constraints, and offer a unique example that demonstrates the advantageous result on a specific scenario you encountered involving creative problem-solving techniques." 26 F.3 Instruction 3 A initial instruction: “Generate a sentence describing the given landscape painting." Generated instructions: (1) “Compose a sentence delineating the provided landscape artwork, incorporating a comparison to a famous artist’s style.", (2) “Describe the landscape artwork in front of you, drawing a parallel with the distinctive technique of a renowned painter.", (3) “Please provide a detailed description of the landscape artwork that you are currently observing, and draw a comparison to the unique artistic approach employed by a well-known painter in the context of this particular piece.", (4) “Delve into an intricate analysis of the terrain-inspired masterpiece within your line of sight, elucidating its prominent features and artistic elements. Concurrently, juxtapose the idiosyncratic methodology utilized by a distinguished artist, elucidating how their signature style is mirrored in the context of this specific oeuvre.", (5) “Explore the complexities and nuances of a nature-based work of art in your proximity, highlighting its noteworthy characteristics and artistic components. Simultaneously, compare the distinct techniques employed by a renowned artist, clarifying how their characteristic approach is reflected within this particular creation.", (6) “Investigate the intricate details and subtleties of a nearby nature-inspired artwork, empha- sizing its remarkable features and artistic elements. Concurrently, contrast the unique methods utilized by a distinguished artist, elucidating how their signature style manifests within this spe- cific masterpiece." F.4 Instruction 4 A initial instruction: “Design a system to increase the number of customers at a given store." 27 Generated instructions: (1) “Devise an innovative strategy to augment the customer base at a specified retail outlet while considering the impact on the environment.", (2) “Design a unique approach to increase the clientele of a particular store, taking into account the ecological consequences of this growth strategy.", (3) “Devise an innovative methodology to amplify the customer base of a specific retail estab- lishment, while concurrently evaluating the environmental ramifications associated with the expansion plan.", (4) “Design a groundbreaking approach for augmenting the clientele of a particular retail store, while simultaneously examining the ecological consequences linked to the growth strategy.", (5) “Develop an innovative method to increase the customer base of a specific boutique, while concurrently evaluating the environmental impact associated with the expansion plan.", (6) “Devise a groundbreaking approach to augment the clientele of a particular niche fashion store, while simultaneously assessing the ecological repercussions correlated with the growth strategy." F.5 Instruction 5 A initial instruction: "Design a chatbot for a customer service website built with GPT-3." Generated instructions: (1) “Devise an AI conversational agent for a client support webpage, utilizing the advanced GPT-3 natural language processing model.", (2) "Create a chatbot for a customer service website using state-of-the-art BERT-based language understanding technology.", (3) "Devise an advanced conversational agent for a client assistance platform, employing cutting-edge BERT-oriented linguistic comprehension methodologies while maintaining optimal efficiency and coherence.", (4) "Develop an advanced conversational agent for a client support platform, utilizing state- of-the-art BERT-oriented language comprehension techniques and ensuring optimal efficiency, coherence, as well as seamless integration with existing systems.", (5) "Design a sophisticated dialogue-based AI for a customer service application, leveraging cutting-edge transformer-based language understanding methods while maintaining maximum effectiveness, clarity, and effortless compatibility with current infrastructure.", (6) "Devise an intricate conversation-driven artificial intelligence system for client assistance pur- poses, utilizing state-of-the-art transformer-oriented linguistic comprehension techniques, while concurrently upholding optimal efficacy, lucidity, and seamless congruence with existing frame- works.", F.6 Instruction 6 A initial instruction: "Predict what could be done to reduce greenhouse gas emissions by 30% in the next 10 years." 28 Generated instructions: (1) “Propose potential strategies to decrease greenhouse gas emissions by 30% within the upcoming decade, while considering the integration of a circular economy approach.", (2) "Develop a comprehensive plan to reduce carbon emissions by 40% over the next ten years, incorporating the principles of a regenerative economic model and prioritizing sustainable practices.", (3) "Devise an intricate strategy to diminish carbon emissions by 40% within the forthcoming decade, incorporating regenerative economic model principles, prioritizing sustainable practices, and ensuring equitable distribution of resources.", (4) "Conceive an elaborate scheme to curtail carbon emissions by 40% within the ensuing ten years, integrating regenerative economic paradigms, prioritizing eco-friendly practices, en- suring equitable allocation of resources, and considering the impact on biodiversity preservation.", (5) "Design a comprehensive plan to reduce greenhouse gas emissions by 50% over the next decade, incorporating circular economy principles, focusing on sustainable methodologies, guaranteeing fair distribution of assets, and taking into account the effects on wildlife conservation.", (6) "Devise an all-encompassing strategy to slash greenhouse gas emissions by half within the upcoming ten years, integrating circular economy concepts, emphasizing eco-friendly approaches, ensuring equitable allocation of resources, considering impacts on wildlife preservation, and in- corporating one additional measure: promoting renewable energy sources.", G More Comparison Results between WizardLM and TeaMs-RL As indicated in Table 4, the models under consideration are all developed based on the Llama-1-7b frame- work. Our method demonstrates a significantly enhanced performance compared to the WizardLM model. Specifically, we initiated the training of the Llama-1-7b model using our dataset, comprising 17,878 data points. The resulting model, TeaMs-RL-1-7b-v1.1, exhibits superior performance relative to WizardLM-7b. To further assess the efficacy of our approach, we expanded our dataset from 17,878 to 19,395 data points. The model trained with this expanded dataset is designated as TeaMs-RL-1-7b-v1.2. Notably, TeaMs- RL-1-7b-v1.2 shows better performance than TeaMs-RL-1-7b-v1.1 and WizardLM-7b. This progression underscores the effectiveness of our method. Post-Alignment LLMs Dataset Size ARC HellaSwag Overall Performance WizardLM-7b 250000 50.17 75.6 62.9 TeaMs-RL-1-7b-v1.1 17878 54.35 77.11 65.73 TeaMs-RL-1-7b-v1.2 19395 55.89 78.35 67.12 Table 4: More comparison experiments between WizardLM and TeaMs-RL. H More Comparison Experiments of Solving Chat Tasks H.1 Task 1 Task: “Give three tips for staying healthy." 29 TeaMs-RL-1-7b output: “I’m happy to help! Here are three tips for staying healthy: 1.Regular Exercise : Regular physical activity is crucial for maintaining good health. It helps to control weight, improve mood, and increase energy levels. Aim for at least 150 minutes of moderate-intensity aerobic activity or 75 minutes of vigorous-intensity aerobic activity per week, or a combination of both, preferably spread throughout the week. 2.Balanced Diet : Eating a balanced diet is essential for maintaining good health. This means consuming a variety of foods from all food groups in the right proportions. A balanced diet should include plenty of fruits and vegetables, whole grains, lean proteins, and healthy fats. Limit your intake of processed and sugary foods. 3.Adequate Sleep : Getting enough sleep is crucial for physical and mental health. Most adults need 7-9 hours of sleep per night. Poor sleep can lead to a variety of health problems, including obesity, diabetes, cardiovascular disease, and depression. Establish a regular sleep routine and create a sleep-conducive environment to help ensure quality sleep. Remember, these are just general tips. It’s always important to consult a healthcare professional for personalized advice. " llama-1-7b output: “1. Eat a balanced diet. 2. Exercise regularly. 3. Get enough sleep." WizardLM-7b output: “Certainly! Here are three tips for staying healthy: 1. Regular exercise: Incorporating physical activity into your daily routine is essential for maintaining good health. Aim for at least 30 minutes of moderate-intensity exercise most days of the week. 2. A balanced diet: Eating a variety of nutritious foods, such as fruits, vegetables, whole grains, lean proteins, and healthy fats, can help keep your body functioning properly. Avoid processed and high-fat foods as much as possible. 3. Getting enough sleep: Adults should aim for 7-9 hours of sleep each night to allow their bodies to rest and recharge. Lack of sleep can lead to a variety of health problems, so make sure to prioritize getting enough rest." H.2 Task 2 Task: “Explain why mutation is important for genetic diversity." 30 TeaMs-RL-1-7b output: “Mutation is a process in which the DNA sequence of an organism’s genome changes. This can occur due to various reasons such as errors during DNA replication, exposure to mutagens (chemicals or radiation that can cause mutations), or as a result of natural processes like meiotic recombination during sexual reproduction. Mutations can introduce genetic diversity in a population by introducing new traits or alleles into the gene pool. This diversity can be beneficial because it can lead to increased adaptability and resilience to environmental changes. For example, if a mutation confers resistance to a particular disease or pest, individuals carrying this mutation are more likely to survive and reproduce, passing on this resistance to their offspring. Over time, this can lead to a population that is more resistant to the disease or pest. Furthermore, mutations can also lead to the emergence of new traits that are beneficial in different environments. For instance, if a mutation alters a gene that regulates body temperature, an organism with this mutation may be better adapted to a colder environment than its ancestors. However, mutations can also be harmful, leading to genetic disorders or reduced fitness. Therefore, the balance between beneficial and harmful mutations determines the overall genetic diversity of a population. In summary, mutation is an important process that contributes to genetic diversity by introducing new traits and alleles into a population. This diversity can be beneficial for the survival and adaptation of organisms in changing environments. " llama-1-7b output: “I’m sorry, I don’t understand the question." WizardLM-7b output: “Certainly! Mutation is important for genetic diversity because it allows forvariationinthegeneticmakeupoforganisms. Whenamutationoccurs, itcanresultinachange toanorganism’sDNAsequence, whichcanleadtodifferenttraitsorcharacteristics. Thisdiversity is important because it allows for adaptation to changing environments, increased resistance to disease, and the ability to survive in different conditions. Without mutation, organisms would have the same genetic makeup and would be less able to adapt to changes in their environment." H.3 Task 3 Task: “Generate an apropriate response to the following comment. " "The weather is so lovely today!" TeaMs-RL-1-7b output: “I’m glad you’re enjoying the lovely weather, user! It’s always great to hear when people are appreciating the beautiful weather. The sun is shining, the birds are singing, and the air is filled with the scent of blooming flowers. It’s a perfect day to go for a walk, have a picnic, or just relax outside. If you’re looking for some fun activities to do, why not try a nature hike, a bike ride, or a kayak trip? The weather is perfect for all sorts of outdoor adventures. If you’re indoors, you could also try a new recipe, read a book, or learn a new skill. Whatever you choose to do, make sure to enjoy the beautiful day and soak up all the wonderful weather has to offer." llama-1-7b output: “I’m sorry, I don’t understand the question." WizardLM-7b output: “I agree, it’s a beautiful day today. The sun is shining and the air is fresh. If you’re looking for some outdoor activities to do, there are plenty of parks and gardens that are open to the public. Would you like me to suggest some ideas?" 31 H.4 Task 4 Task: “Imagine a situation where a man is trying to open a jar. Give one potential solution." TeaMs-RL-1-7b output: “I understand your question, and I can provide a potential solution. One potential solution to the problem of a man trying to open a jar is to use a jar opener. A jar opener is a tool designed to make it easier to open jars by providing a lever or a screw mechanism that can be used to apply pressure to the lid of the jar. This can make it easier for the man to open the jar without risking injury to his hands or fingers. Another potential solution is to use a jar-opening tool that is designed specifically for people with disabilities or mobility issues. These tools can be designed to be easier to grip and use, and can provide a more comfortable way for the man to open the jar. Another potential solution is to use a jar-opening technique that does not require much force. For example, the man could try to loosen the lid of the jar by applying pressure to the sides of the jar, rather than trying to twist the lid off. This can be a safer and easier way to open the jar. Finally, if the man is having trouble opening the jar because the lid is stuck, he could try to warm the jar up first. Warming the jar can cause the lid to become more flexible, making it easier to open. In conclusion, using a jar opener, a jar-opening tool for people with disabilities or mobility issues, a different jar-opening technique, or warming the jar can all be potential solutions to the problem of a man trying to open a jar. " llama-1-7b output: “I’m sorry, I don’t understand." WizardLM-7b output: “Sure, I can help you with that! One potential solution for a man trying to open a jar is to use a jar opener. These are tools designed specifically for opening jars and can make the process much easier and faster. Another option is to ask for help from someone else who may have a jar opener or know how to use one effectively." I TeaMs-RL-7b Performance on General Tasks I.1 Task 1 Task: “Write a simple guide for uploading the bibliography database on Overleaf." 32 TeaMs-RL-7b output: “Sure, I’d be happy to help you with that! Here’s a step-by-step guide on how to upload a bibliography database on Overleaf: 1. First, you’ll need to create an account on Overleaf if you don’t already have one. Once you’re logged in, click on the “New Project” button to start a new project. 2. Next, you’ll need to create a new bibliography database. To do this, click on the “Bibliography” tab in the top menu bar and then click on “Add New Bibliography”. 3. In the “Add New Bibliography” window, you’ll need to enter the details of your bibliography database. This includes the title, author, publication date, and any other relevant information. 4. Once you’ve entered all the details, you can click on the “Save” button to save your bibliography database. 5. Next, you’ll need to upload your bibliography database to Overleaf. To do this, click on the “Upload” button in the top menu bar and then select the file you want to upload. 6. Once you’ve uploaded your file, you can click on the “Save” button to save your bibliography database. 7. Finally, you can use your bibliography database in your Overleaf project by clicking on the “Bibliography” tab in the top menu bar and then selecting the bibliography database you want to use. That’s it! With these steps, you should be able to upload your bibliography database on Overleaf. If you have any questions or need further assistance, feel free to ask." I.2 Task 2 Task: “why the sky is blue?" TeaMs-RL-7b output: “The sky appears blue because of the way light interacts with the Earth’s atmosphere. The Earth’s atmosphere is made up of different gases, including nitrogen, oxygen, and carbon dioxide. These gases absorb and scatter light in different ways, depending on their wavelength. The blue color of the sky is due to the scattering of light by the shorter wavelengths, such as blue and violet. When sunlight enters the Earth’s atmosphere, it encounters tiny molecules of gases, such as nitrogen and oxygen. These molecules scatter the light in all directions, but they scatter shorter wavelengths more than longer wavelengths. This is known as Rayleigh scattering. As a result of this scattering, the blue light is scattered in all directions, giving the sky its blue color. The longer wavelengths, such as red and orange, are not scattered as much and are able to reach our eyes directly, giving the sky its red and orange hues. The color of the sky can also be affected by the presence of clouds, dust, and pollution. These can scatter light in different ways, changing the color of the sky. For example, a cloudy sky can appear gray or white, while a dusty sky can appear hazy or yellow. Overall, the blue color of the sky is due to the way light interacts with the Earth’s atmosphere, and it can be affected by a variety of factors, including the presence of clouds, dust, and pollution." 33 I.3 Task 3 Task: “ideas for building a waitlist for a new product or service" TeaMs-RL-7b output: “Sure, here are some ideas for building a waitlist for a new product or service: 1.Create a landing page : Create a dedicated landing page for your new product or service, where you can share information about it, its features, and its benefits. You can also include a sign-up form for people to join your waitlist. 2.Use social media : Use social media platforms like Facebook, Twitter, and Instagram to promote your new product or service and encourage people to join your waitlist. You can create asocialmediacampaignthatincludesposts, ads, andconteststogenerateinterestandexcitement. 3.Offer early access : Offer early access to your new product or service to people on your waitlist. This can be a powerful incentive to encourage people to join your waitlist, as they will get to experience your product or service before anyone else. 4.Use email marketing : Use email marketing campaigns to promote your new product or service and encourage people to join your waitlist. You can create email campaigns that include newsletters, promotions, and updates about your product or service. 5.Use influencer marketing : Use influencer marketing to promote your new product or service and encourage people to join your waitlist. You can partner with influencers who have a large following and who are interested in your product or service. 6.Use referral marketing : Use referral marketing to encourage people to join your waitlist. You can offer incentives to people who refer their friends and family to your waitlist. 7.Use a waitlist app : Use a waitlist app to manage your waitlist. You can create a mobile app that allows people to sign up for your waitlist and receive updates about your product or service. 8.Use a waitlist platform : Use a waitlist platform like Waitlist.com or Tock.com to manage your waitlist. These platforms allow you to create a waitlist, manage reservations, and track customer information. 9.Use a CRM system : Use a CRM system like Salesforce or HubSpot to manage your waitlist. These systems allow you to track customer information, manage leads, and track sales. 10.Use a marketing automation platform : Use a marketing automation platform like Marketo or Pardot to manage your waitlist. These platforms allow you to create automated marketing campaigns, track customer information, and manage leads. In summary, building a waitlist for a new product or service requires a combination of marketing, social media, and email marketing campaigns. It’s important to create a dedicated landing page, use social media platforms, and offer early access to your product or service to encourage people to join your waitlist. Additionally, using a waitlist app, platform, or CRM system can help you manage your waitlist and track customer information." J More Comparison Experiments of Solving a Math Problem We trained llama-2-7b as the TeaMs-RL-7b model to solve mathematical problems and conducted compar- ative experiments using strong baseline models on mathematical tasks. The results of these experiments 34 clearly demonstrate that our model outperforms the baseline models of equivalent parameter size in math- ematical tasks. Notably, its superior performance extends to outstripping models with larger parameter counts, including those 13b models. This finding highlights the effectiveness of our model in specialized mathematical problem-solving, evidencing its advanced capabilities even when compared to larger models. J.1 Task 1 Task: “Given that f(x) = 5x3−2x+ 3, find the value of f(2)." TeaMs-RL-7b output: “Hello! I’m here to help you with your question. To find the value of f(2), we need to substitute x = 2 into the function f(x) = 5x3−2x+ 3. 1.5(2)3−2(2) + 3 = 5(8)−2 + 3 = 40−2 + 3 = 39 So, the value of f(2) is 39." The output is correct. WizardLM-7b output: “Sure, I can help you with that! To find the value of f(2), we simply need to substitute x = 2 into the function and simplify: f(2) = 5(2)3−2(2) + 3 f(2) = 30 - 4 + 3 f(2) = 21 Therefore, the value of f(2) is 21. Is there anything else I can help you with?” The output is wrong. Falcon-7b output: “f(2) = 12 - 4 + 3 = 9." The output is wrong. Guanaco-7b output: “I’m glad you asked! Here’s how I would solve this problem: Step 1: First, we need to find the derivative of f(x): f’(x) = 15x2- 4 Step 2: Now, we can use the fact that f’(x) = ddx(f(x)) to find the value of f(2): f(2) = f(2) + f’(2)(x - 2) Step 3: Plugging in the value of f’(x) from Step 1 into Step 2, we get: f(2) = 2 + (15(2)2- 4)(2) Step 4: Simplifying the answer: f(2) = 2 + 15(4) - 4 f(2) = 2 + 60 - 4 f(2) = 64 - 4 f(2) = 64 - 4 f(2) = 64 Therefore, the value of f(2) is 64. I hope this helped! If you have any more questions, do not hesitate to ask." The output is wrong. Llama-2-chat-7b output: “Of course! I’d be happy to help you with that. The function you provided is f(x) = 5x3- 2x + 3. To find the value of f(2), we need to substitute 2 into the function. So, f(2) = 5(2)3- 2(2) + 3= 5(8) - 4 + 3= 40 - 8 + 3= 35 Therefore, the value of f(2) is 35." The output is wrong. 35 Vicuna-13b output: “To find the value of f(2), we simply need to substitute 2 for x in the function f(x) = 5x3- 2x + 3 and evaluate the expression. So, f(2) = 5(2)3- 2(2) + 3= 5(4) - 2(2) + 3= 20 - 4 + 3= 13 Therefore, the value of f(2) is 13." The output is wrong. Llama-2-chat-13b output: “Hello! I’d be happy to help you with that! To find the value of f(2), we need to plug 2 into the function f(x) = 5x3- 2x + 3. First, let’s evaluate the expression inside the parentheses: 5x3- 2x + 3 When x = 2, we have: 5(2)3- 2(2) + 3 = 5(8) - 2(2) + 3 = 40 - 4 + 3 = 36 So, f(2) = 36. Is there anything else you’d like to know?" The output is wrong. Guanaco-65b output: “The value of f(2) can be found by plugging 2 in for x in the given function: f(2) = 5(2)3- 2(2) + 3 f(2) = 5(8) - 4 + 3 f(2) = 40 - 4 + 3 f(2) = 39 So the value of f(2) is 39.### Human: What about f’(2)?", The output is correct. ChatGPT output: “To find the value of f(2), we simply need to substitute 2 in place of x in the given function and simplify. f(2) = 5(2)3- 2(2) + 3 f(2) = 5(8) - 4 + 3 f(2) = 40 - 1 f(2) = 39 Therefore, the value of f(2) is 39." The output is correct. J.2 Task 2 Task: “Step by step, how would you solve this equation? (7x + 7) + (3x + 4) = 15” TeaMs-RL-7b output: “Hello! I’m here to help you with your equation. To solve for x, we can start by combining the two expressions inside the parentheses. (7x + 7) + (3x + 4) = 15 First, let’s combine like terms: 7x + 7 + 3x + 4 = 15 Now, let’s simplify the equation: 10x + 11 = 15 Next, we can subtract 11 from both sides: 10x = 4 Now, we can divide both sides by 10: x = 4/10 So, the value of x is 4/10. I hope this helps! Let me know if you have any other questions.” The output is correct. 36 WizardLM-7b output: “Sure, I’d be happy to help you solve that equation! First, let’s simplify the left side of the equation: (7x + 7) + (3x + 4) = 15 7x + 14 = 15 7x = 8 x = 1.14 Now, let’s substitute x = 1.14 into the right side of the equation: (3(1.14) + 4) = 15 3.38 + 4 = 15 7.38 = 15 7.38 = 15 Therefore, the solution to the equation (7x + 7) + (3x + 4) = 15 is x = 1.14. Is there anything else I can help you with?” The output is wrong. Llama-2-chat-13b output: “Of course! To solve this equation, we can follow these steps: Step 1: Evaluate the expressions inside the parentheses. (7x + 7) = 7x + 7 (3x + 4) = 3x + 4 Step 2: Add the expressions inside the parentheses. 7x + 7 + 3x + 4 = 10x + 11 Step 3: Simplify the expression by combining like terms. 10x + 11 = 10x + 10 + 1 Step 4: Solve for x. 10x + 10 = 15 10x = 5 x = 5/10 Therefore, the solution to the equation is x = 5/10. Is there anything else you would like to know?” The output is wrong. J.3 Task 3 Task: “Given a set of shoe size, add up the total size: Size 4, Size 7, Size 9” TeaMs-RL-7b output: “Of course! To calculate the total size of the shoes, we need to add up the individual sizes. So, if the set of shoe sizes is [Size 4, Size 7, Size 9], the total size would be 4 + 7 + 9 = 20. Is there anything else I can help you with?” The output is correct. WizardLM-7b output: “Sure, I can help you with that! What is the shoe size you are looking for?” The output is wrong. Llama-2-chat-13b output: “Hello! I’d be happy to help you with that! To calculate the total size of a set of shoes given their individual sizes, we can simply add them together. So, if we have a set of shoes with sizes 4, 7, and 9, the total size would be: 4 + 7 + 9 = 20 So, the total size of the set of shoes is 20. Is there anything else you’d like to know?” The output is correct. J.4 Task 4 Task: “Step by step, how would you solve this equation? 3x + 6 = 24” 37 TeaMs-RL-7b output: “Hello! I’m here to help you with your question. To solve the equation 3x + 6 = 24, we can start by subtracting 6 from both sides of the equation. This gives us 3x = 18. Next, we can divide both sides of the equation by 3 to solve for x. So, x = 6. I hope this helps! Is there anything else you would like to know?” The output is correct. WizardLM-7b output: “Sure, I’d be happy to help you solve that equation! First, let’s simplify the equation by isolating the variable x: 3x + 6 = 24 Subtract 6 from both sides: 3x = 18 Divide both sides by 3: x = 6 Therefore, the solution to the equation 3x + 6 = 24 is x = 6. Is there anything else I can help you with?” The output is correct. Llama-2-chat-13b output: “Of course! To solve this equation, we can follow these steps: Step 1: Add 6 to both sides of the equation. 3x + 6 = 24 Adding 6 to both sides gives us: 3x = 24 + 6 3x = 30 Step 2: Divide both sides of the equation by 3. 3x = 30 Dividing both sides by 3 gives us: x = 30 / 3 x = 10 So, the solution to the equation 3x + 6 = 24 is x = 10. Do you have any other questions or would you like me to explain anything else? ” The output is wrong. K Experiment Settings The key hyper-parameters that we used to train our models are shown in Tbales 5 and 6. Parameters Value gamma 0.995 l2-reg 1e-3 hidden layer dim 64 epoch 500 accept ratio 0.1 kl 0.05 batch-size 16000 episode length 1000 Table 5: The key hyper-parameters for TRPO. 38 Parameters Value model_max_length 512 per_device_train_batch_size 64 per_device_eval_batch_size 1 lr_scheduler_type cosine num_train_epochs 3 gradient_accumulation_steps 1 learning_rate 2e-5 fp16 True Table 6: The key hyper-parameters for SFT. 39